This is the 4th project of Udacity’s Data Analyst NanoDegree Program. We were given several data sources as options to analyze from. I chose the Arizona’s 2016 Presidential Campaign Finance from the Federal Election Commission (FEC) website.
The format of this analysis will be as the following:
1- I will declare my intentions with a hypothesis (if applicable)
2- Insert a R snippet/code and run it
3- Declare my findings.
And so on…
First I will begin the analysis by exploring basic statistics about the data set. This will help me see the nature of the data, and whether the data needs cleaning or wrangling. Afterwards, I will explore variable and multivariate relationships, by using the methods I have learned in chapter 4, such as scatter, line, box plots and histograms. This is the basic outline of the analysis, but surely I will find interesting things to talk about along the way.
# import data and packages
az <- read.csv('P-AZ1')
az$x <- NULL
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
##
## collect, recode, rename
## The following object is masked from 'package:plyr':
##
## rename
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
library(reshape2)
library(gender)
library(stringr)
require(magrittr)
## Loading required package: magrittr
require(zipcode)
## Loading required package: zipcode
require(tmap)
## Loading required package: tmap
require(glue)
## Loading required package: glue
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
require(githubinstall)
## Loading required package: githubinstall
# Install choroplethrZip
#install.packages("devtools")
#library(devtools)
#install_github('arilamstein/choroplethrZip@v1.5.0')
x <- c("ggmap", "rgdal", "rgeos", "maptools", "dplyr", "tidyr", "tmap")
lapply(x, library, character.only = TRUE) # load the required packages
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
##
## inset
## Loading required package: sp
## rgdal: version: 1.2-15, (SVN revision 691)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.0, released 2017/04/28
## Path to GDAL shared files: C:/Users/alema/Documents/R/win-library/3.4/sf/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/alema/Documents/R/win-library/3.4/sf/proj
## Linking to sp version: 1.2-5
## rgeos version: 0.3-26, (SVN revision 560)
## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0
## Linking to sp version: 1.2-5
## Polygon checking: TRUE
## Checking rgeos availability: TRUE
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
## The following object is masked from 'package:reshape2':
##
## smiths
## [[1]]
## [1] "ggmap" "githubinstall" "glue" "tmap"
## [5] "zipcode" "magrittr" "stringr" "gender"
## [9] "reshape2" "memisc" "MASS" "lattice"
## [13] "gridExtra" "dplyr" "plyr" "ggplot2"
## [17] "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[2]]
## [1] "rgdal" "sp" "ggmap" "githubinstall"
## [5] "glue" "tmap" "zipcode" "magrittr"
## [9] "stringr" "gender" "reshape2" "memisc"
## [13] "MASS" "lattice" "gridExtra" "dplyr"
## [17] "plyr" "ggplot2" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods"
## [25] "base"
##
## [[3]]
## [1] "rgeos" "rgdal" "sp" "ggmap"
## [5] "githubinstall" "glue" "tmap" "zipcode"
## [9] "magrittr" "stringr" "gender" "reshape2"
## [13] "memisc" "MASS" "lattice" "gridExtra"
## [17] "dplyr" "plyr" "ggplot2" "stats"
## [21] "graphics" "grDevices" "utils" "datasets"
## [25] "methods" "base"
##
## [[4]]
## [1] "maptools" "rgeos" "rgdal" "sp"
## [5] "ggmap" "githubinstall" "glue" "tmap"
## [9] "zipcode" "magrittr" "stringr" "gender"
## [13] "reshape2" "memisc" "MASS" "lattice"
## [17] "gridExtra" "dplyr" "plyr" "ggplot2"
## [21] "stats" "graphics" "grDevices" "utils"
## [25] "datasets" "methods" "base"
##
## [[5]]
## [1] "maptools" "rgeos" "rgdal" "sp"
## [5] "ggmap" "githubinstall" "glue" "tmap"
## [9] "zipcode" "magrittr" "stringr" "gender"
## [13] "reshape2" "memisc" "MASS" "lattice"
## [17] "gridExtra" "dplyr" "plyr" "ggplot2"
## [21] "stats" "graphics" "grDevices" "utils"
## [25] "datasets" "methods" "base"
##
## [[6]]
## [1] "tidyr" "maptools" "rgeos" "rgdal"
## [5] "sp" "ggmap" "githubinstall" "glue"
## [9] "tmap" "zipcode" "magrittr" "stringr"
## [13] "gender" "reshape2" "memisc" "MASS"
## [17] "lattice" "gridExtra" "dplyr" "plyr"
## [21] "ggplot2" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
##
## [[7]]
## [1] "tidyr" "maptools" "rgeos" "rgdal"
## [5] "sp" "ggmap" "githubinstall" "glue"
## [9] "tmap" "zipcode" "magrittr" "stringr"
## [13] "gender" "reshape2" "memisc" "MASS"
## [17] "lattice" "gridExtra" "dplyr" "plyr"
## [21] "ggplot2" "stats" "graphics" "grDevices"
## [25] "utils" "datasets" "methods" "base"
# convert zip code to factor
az$contbr_zip <- factor(az$contbr_zip)
#convert from char to date class
az$proper_date <- as.Date(az$contb_receipt_dt, format = '%d-%B-%y')
The structure of the data is as the following The file has 19 variables, and these are the most important ones to for the analysis:
I wish if there was a party and a gender column, I will try to it below.
# Add party col
# Note code template was taken from Udacity Forums
index <- c("Johnson, Gary", "Stein, Jill", "McMullin, Evan")
dindex <- c("Clinton, Hillary Rodham", "Sanders, Bernard", "Lessig, Lawrence", "O'Malley, Martin Joseph", "Webb, James Henry Jr.")
rindex <- c('Bush, Jeb', "Carson, Benjamin S."
, "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
"Fiorina, Carly", "Gilmore, James S III" ,
"Graham, Lindsey O.", "Huckabee, Mike",
"Jindal, Bobby", "Kasich, John R.",
"Paul, Rand", "Perry, James R. (Rick)",
"Rubio, Marco", "Trump, Donald J.",
"Walker, Scott" )
attach(az)
az$party[cand_nm %in% index] <- "independent"
az$party[cand_nm %in% dindex] <- "democrat"
az$party[cand_nm %in% rindex] <- 'republican'
detach(az)
# Convert party to factor
az$party <- factor(az$party)
I also would like to add other information such as latitudes and longitudes for map analysis
data(zipcode)
az_loc <- subset(zipcode, state == 'AZ')
az$clean_zip <- substring(az$contbr_zip, 1, 5)
az <- merge(az, az_loc, by.x = 'clean_zip', by.y = 'zip')
I would also like to integrate population data by zip-code from the 2010 ZCTA census.
# Add Gender col to candidates
# Gender indices
m_index <- c('Bush, Jeb', "Carson, Benjamin S."
, "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
"Gilmore, James S III" ,
"Graham, Lindsey O.", "Huckabee, Mike",
"Jindal, Bobby", "Kasich, John R.",
"Paul, Rand", "Perry, James R. (Rick)",
"Rubio, Marco", "Trump, Donald J.",
"Walker, Scott", "Sanders, Bernard",
"Lessig, Lawrence", "O'Malley, Martin Joseph",
"Webb, James Henry Jr.", "Johnson, Gary",
"McMullin, Evan"
)
f_index <- c("Clinton, Hillary Rodham", "Fiorina, Carly", "Stein, Jill" )
#simple cand_gender
az$cand_gender <-NA
attach(az)
az$cand_gender[cand_nm %in% m_index] <- "Male"
az$cand_gender[cand_nm %in% f_index] <- "Female"
detach(az)
# convert cand_gender to factor
az$cand_gender <- factor(az$cand_gender)
Now that I added candidates’ genders, I’ll add the contributors’ genders, by using the gender package.
# contributers' genders
# Get first names in a seperate col
az$first_name <- str_split_fixed(az$contbr_nm, ", ", 2)[,2]
# Use gender function
gender_df <- gender(as.character(az$first_name), c(1932, 1998),
countries= "United States")
# Assign gender to contributers in az df
names(gender_df)[1] = "first_name"
names(gender_df)[4] = 'contrib_gender'
gender_df <- unique(gender_df)
az <- merge(az, gender_df[ c("first_name", "contrib_gender")])
# convert contrib_gender to factor
az$contrib_gender <- factor(az$contrib_gender)
# Create a function that gets top ten counts
top10 <- function(x){
y <-table(x)
y <- sort(y, decreasing = T)
y <- as.data.frame(y)
y[1:10,]
}
# count city occurences in data set
city_tab <- top10(az$contbr_city)
ggplot(city_tab, aes(x, Freq))+
geom_bar(stat='identity')
# top contribs counts
top_parties <-top10(az$party)
ggplot(top_parties, aes(x, Freq))+
geom_bar(stat='identity')
## Warning: Removed 7 rows containing missing values (position_stack).
city_tab
I would like to know how the data is distributed.
#money disterbution
ggplot(az, aes(contb_receipt_amt))+
geom_histogram(binwidth = 15)+
coord_cartesian(xlim=c(0:500))
#money disterbution normalized
ggplot(az, aes(contb_receipt_amt))+
geom_histogram()+
scale_x_log10()
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1513 rows containing non-finite values (stat_bin).
ggplot(az, aes(contb_receipt_amt))+
geom_freqpoly()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Date distribution
#Date disterbution
ggplot(az, aes(proper_date))+
geom_histogram(binwidth = 20)
It appears that we have negative numbers, that goes all to -5400. I believe that it represents refunds, since the most receipt comment is receipt.
I want to find out the amount stats without the refunds.
# most popular bill denomination/
summary(subset(az$contb_receipt_amt, az$contb_receipt_amt > 0 ))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04 15.00 27.00 80.21 61.86 10800.00
non_zero_rec <- subset(az$contb_receipt_amt, az$contb_receipt_amt > 0 )
tab <- table(non_zero_rec)
str(tab)
## 'table' int [1:2224(1d)] 1 1 2 2 1 25 1 861 1 1 ...
## - attr(*, "dimnames")=List of 1
## ..$ non_zero_rec: chr [1:2224] "0.04" "0.12" "0.24" "0.5" ...
tab[tab>1000]
## non_zero_rec
## 3 5 8 10 15 19 20 25 27 28 35 38
## 1126 7727 1869 11486 5699 2539 2638 17543 5432 2258 2322 1014
## 40 50 75 80 100 200 250 500
## 2340 13346 1325 2038 11547 1941 4095 1380
top_deno<- top10(az$contb_receipt_amt)
ggplot(top_deno, aes(x, Freq))+
geom_bar(stat='identity')
I wonder why the odd numbers such as 19, 27 or even 38.
Below we will see the number of contributions for each candidate and how they break out:
# Most count of cont.
sort(summary(az$cand_nm), decreasing= T)
## Clinton, Hillary Rodham Sanders, Bernard
## 53861 35784
## Trump, Donald J. Cruz, Rafael Edward 'Ted'
## 16087 7129
## Carson, Benjamin S. Rubio, Marco
## 2954 1657
## Fiorina, Carly Paul, Rand
## 462 426
## Johnson, Gary Kasich, John R.
## 318 263
## Stein, Jill Bush, Jeb
## 199 122
## Huckabee, Mike McMullin, Evan
## 101 98
## Walker, Scott O'Malley, Martin Joseph
## 95 29
## Christie, Christopher J. Santorum, Richard J.
## 19 19
## Jindal, Bobby Graham, Lindsey O.
## 10 9
## Webb, James Henry Jr. Lessig, Lawrence
## 5 4
## Perry, James R. (Rick) Gilmore, James S III
## 1 0
# top cand in terms of money
cand_groups <- group_by(az, cand_nm)
cand_sum <-summarize(cand_groups,
mean(contb_receipt_amt),
n = n())
tot_rec <- aggregate(az$contb_receipt_amt, list(az$cand_nm), sum)
tot_rec <- arrange(tot_rec, desc(x))
tot_rec
# Top in money recieved histogram
ggplot(data= tot_rec,
aes(reorder(Group.1, -x), x)) +
geom_bar(stat="identity")
# Normalized
ggplot(data= tot_rec,
aes(reorder(Group.1, -x), x)) +
geom_bar(stat="identity")+
scale_y_log10()
I would like to see the box-plot of each gender/party contribution.
ggplot(az[!is.na(az$party) & az$party != 'independent',], aes(x=contrib_gender, y=contb_receipt_amt))+
geom_boxplot(varwidth=T)+
facet_grid(~party)+
coord_cartesian(ylim=boxplot.stats(az$contb_receipt_amt)$stats[c(1, 5)]
)
Republicans contributed more on average, and they had a higher range of contribution amounts. Male republicans contributed slightly more on average than their female counterparts.
table(az$contrib_gender)
##
## female male
## 65569 54083
I did not expect to find more female contributors than males in this data-set.
Lets explore if females were more likely to vote for female candidates.
same_sexf<- subset(az, az$contrib_gender == 'female' & az$cand_gender== 'Female')
length(same_sexf$contrib_gender)/length(az$contrib_gender[az$contrib_gender == 'female'])
## [1] 0.546188
same_sexm<- subset(az, az$contrib_gender == 'male' & az$cand_gender== 'Male')
length(same_sexm$contrib_gender)/ length(az$contrib_gender[az$contrib_gender == 'male'])
## [1] 0.653514
ggplot(az[!is.na(az$contrib_gender) & !is.na(az$cand_gender),] , aes(cand_gender, contrib_gender))+
geom_bin2d()
%54.5 females of this data-set contributed to females, while %65.4 of males contributed to males, which is a negligible preference.
# Create the mode function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# get distinct empoyee contrib from each employer
employer<- summary(az$contbr_employer)
# most genrous occupations and account of total
#subset academia employees
academia <- subset(az, grepl('UNIVERSITY|COLLEGE', az$contbr_employer, ignore.case = T))
# group by university/college
cand_acad <- group_by(academia, contbr_employer)
# how much is averagly contributed by academia
spend_acad <- summarise(cand_acad,
number_of_contributions = n(),
avg_spent = mean(contb_receipt_amt),
main_cand = getmode(cand_nm),
party = getmode(party)
)
# sort by num of contribs
attach(spend_acad)
spend_acad <- spend_acad[order(-number_of_contributions),]
detach(spend_acad)
spend_acad
# percentage of demo/repub colleges
percent(spend_acad$party)
## democrat independent republican N
## 80.341880 1.709402 17.948718 117.000000
# Percentage of candidate preferance
sort(percent(academia$cand_nm)[1:24],decreasing = T)
## Clinton, Hillary Rodham Sanders, Bernard
## 59.54716981 36.50314465
## Cruz, Rafael Edward 'Ted' Trump, Donald J.
## 1.43396226 1.40880503
## Carson, Benjamin S. Rubio, Marco
## 0.42767296 0.20125786
## Stein, Jill Paul, Rand
## 0.20125786 0.15094340
## Graham, Lindsey O. Fiorina, Carly
## 0.05031447 0.02515723
## Johnson, Gary Kasich, John R.
## 0.02515723 0.02515723
## Bush, Jeb Christie, Christopher J.
## 0.00000000 0.00000000
## Gilmore, James S III Huckabee, Mike
## 0.00000000 0.00000000
## Jindal, Bobby Lessig, Lawrence
## 0.00000000 0.00000000
## McMullin, Evan O'Malley, Martin Joseph
## 0.00000000 0.00000000
## Perry, James R. (Rick) Santorum, Richard J.
## 0.00000000 0.00000000
## Walker, Scott Webb, James Henry Jr.
## 0.00000000 0.00000000
#
Around %80 of colleges had a democratic preference. The majority of %59 of contributions were for Clinton, Sanders cones in second of %37. Cruz came in third (%1.42) and Trump close fourth (%1.4).
Below I will find the stats of homemakers and retirees
# Homemaker stats
homemaker <- subset(az, grepl('HOMEMAKER', az$contbr_occupation, ignore.case = T ))
summary(homemaker)
## first_name clean_zip cmte_id cand_id
## Length:1076 Length:1076 C00575795:600 P00003392:600
## Class :character Class :character C00574624:146 P60006111:146
## Mode :character Mode :character C00577130:105 P60007168:105
## C00573519: 99 P60005915: 99
## C00580100: 84 P80001571: 84
## C00458844: 14 P60006723: 14
## (Other) : 28 (Other) : 28
## cand_nm contbr_nm
## Clinton, Hillary Rodham :600 BORCH, INGER : 39
## Cruz, Rafael Edward 'Ted':146 GUIDARELLI-AMBRAD, DEBORAH: 35
## Sanders, Bernard :105 FRANK, GLORIA : 29
## Carson, Benjamin S. : 99 FRANZ, ROBIN : 29
## Trump, Donald J. : 84 DOVER, RITA : 28
## Rubio, Marco : 14 BADE, KRISTI : 27
## (Other) : 28 (Other) :889
## contbr_city contbr_st contbr_zip contbr_employer
## SCOTTSDALE :212 AZ:1076 857507118: 39 N/A :531
## TUCSON :171 852533610: 35 HOMEMAKER :306
## PHOENIX :143 852043820: 29 RETIRED : 59
## GILBERT : 85 852951792: 29 NONE : 41
## MESA : 83 853021415: 28 NOT EMPLOYED: 39
## PARADISE VALLEY: 57 852543072: 27 MY CHILDREN : 25
## (Other) :325 (Other) :889 (Other) : 75
## contbr_occupation contb_receipt_amt
## HOMEMAKER :1028 Min. : -40.0
## UNEMPLOYED - HOMEMAKER : 25 1st Qu.: 25.0
## HOMEMAKER / PHOTOGRAPHER / MSW: 5 Median : 50.0
## HOMEMAKER/ACTIVIST/ARTIST : 5 Mean : 137.1
## HUSBAND/MECHANICWIFE/HOMEMAKER: 5 3rd Qu.: 100.0
## HOMEMAKER/PHYSICIAN : 3 Max. :2700.0
## (Other) : 5
## contb_receipt_dt
## 19-OCT-16: 14
## 03-NOV-16: 12
## 06-NOV-16: 12
## 09-OCT-16: 12
## 26-SEP-16: 12
## 04-NOV-16: 11
## (Other) :1003
## receipt_desc
## :1076
## * EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING: 0
## * REATTRIBUTED FROM EDWARD FARMILANT : 0
## * REATTRIBUTED TO BARBARA FAMILANT : 0
## * REATTRIBUTED TO VICTORIA STRONG : 0
## EVENT PLANNING REATTRIBUTION FROM SPOUSE : 0
## (Other) : 0
## memo_cd memo_text form_tp
## :906 :872 SA17A:912
## X:170 * EARMARKED CONTRIBUTION: SEE BELOW: 99 SA18 :164
## * HILLARY VICTORY FUND : 98 SB28A: 0
## *BEST EFFORTS UPDATE : 5
## * : 1
## EARMARKED FROM MAKE DC LISTEN : 1
## (Other) : 0
## file_num tran_id election_tp
## Min. :1014598 C5628470 : 2 : 4
## 1st Qu.:1077853 A105C04C73FFA4C859DB: 1 G2016:452
## Median :1109498 A6BF5A3EFECE4468B9E9: 1 O2016: 1
## Mean :1103419 A85C4E16099CC4E5F8A1: 1 P2016:619
## 3rd Qu.:1133930 AAA1CD0DBF8AB4B9281D: 1 P2020: 0
## Max. :1146165 AFCCA0974E8D949428D0: 1
## (Other) :1069
## proper_date party city
## Min. :2015-04-01 democrat :705 Length:1076
## 1st Qu.:2016-02-27 independent: 14 Class :character
## Median :2016-06-21 republican :357 Mode :character
## Mean :2016-05-23
## 3rd Qu.:2016-09-21
## Max. :2016-12-02
##
## state latitude longitude cand_gender
## Length:1076 Min. :31.49 Min. :-114.6 Female:602
## Class :character 1st Qu.:33.30 1st Qu.:-112.1 Male :474
## Mode :character Median :33.49 Median :-111.9
## Mean :33.37 Mean :-111.8
## 3rd Qu.:33.62 3rd Qu.:-111.7
## Max. :36.62 Max. :-109.4
##
## contrib_gender
## female:1035
## male : 41
##
##
##
##
##
# Percentage of homemaker genders
percent(homemaker$contrib_gender)
## female male N
## 96.189591 3.810409 1076.000000
# Homemaker contribs party leaning
percent(homemaker$party)
## democrat independent republican N
## 65.520446 1.301115 33.178439 1076.000000
# retired findings
retired <- subset(az, grepl('RETIRED', az$contbr_occupation, ignore.case = T))
#percentage of retirees in the data
print(
length(unique(retired$contbr_nm))/
length(unique(az$contbr_nm)) * 100)
## [1] 35.57178
# Retired party leaning percentage
print (summary(retired$party)/ length(retired$party) *100)
## democrat independent republican NA's
## 60.00657639 0.44540101 39.52410845 0.02391415
# Retrired avg spending
mean(retired$contb_receipt_amt)
## [1] 81.05961
Homemakers are %96 females, and %65 of homemakers are democrats.
As we can see above, retirees make up about %35.6 of the data-set. Around %60 of retirees contributed to democrats and around %40 percent to republicans, contributions to independents are negligible. Retirees contributed $81 on average.
I want to know which occupations are most politically active, and how do they lean politically.
jobs <-data.matrix(summary(az$contbr_occupation))
head(jobs,10)
## [,1]
## RETIRED 32749
## NOT EMPLOYED 13737
## INFORMATION REQUESTED 3214
## ATTORNEY 2107
## PHYSICIAN 1897
## TEACHER 1821
## ENGINEER 1389
## CONSULTANT 1272
## PROFESSOR 1239
## SALES 1238
The most politically active occupations in the data set are attorneys, physicians then teachers.
Below I would like to know the proportions of party leaning for each job. For example, of all engineers how many percent of them lean republican (number of republican engineers/ total number of engineers).
require(GGally)
## Loading required package: GGally
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
require(gmodels)
## Loading required package: gmodels
# Create a DF for jobs and their party count
job_party_table<- as.data.frame.matrix(table(az$contbr_occupation, az$party))
# Create a DF with jobs and thier party percentages
job_party_dist<- prop.table(table(az$contbr_occupation, az$party), 1)
job_party_dist <- as.data.frame.table(job_party_dist)
# Merge the two DFs above
job_party_props<- data.frame(c(job_party_table, job_party_dist))
#Keep these cols
keep <- c('Var1','democrat', 'republican', 'Var2', 'Freq')
job_party_props <- job_party_props[,keep]
#change props to percentages
job_party_props$Freq <- job_party_props$Freq *100
# add a total_job col
job_party_props$total <- job_party_props$democrat + job_party_props$republican
#drop these cols
job_party_props$democrat <- NULL
job_party_props$republican <- NULL
# change Var2 col to party
names(job_party_props)[names(job_party_props) == 'Var2'] <- 'party'
# subset top 10 republican jobs
top10rep <-subset(job_party_props, party == 'republican' & total > 1000)
# create a pie chart
pie(top10rep$Freq, labels = top10rep$Var1)
#create a bar plot
ggplot(top10rep,
aes(Var1, Freq))+
geom_bar(stat = "identity")
# subset top 10 democrats jobs
top10demo <-subset(job_party_props, party == 'democrat' & total > 1000 & quantile(job_party_props$Freq, c(.660), na.rm = T))
# create a bar plot
ggplot(top10demo,
aes(Var1, Freq))+
geom_bar(stat = "identity")
# subset top 10 independents jobs
top10inde <-subset(job_party_props, party == 'independent' & Freq > quantile(Freq, probs = .62, na.rm = T) & total >100 & Var1 != 'N/A')
# create a bar plot
ggplot(top10inde,
aes(Var1, Freq))+
geom_bar(stat = "identity")
I would like to see average spending along dates
#group by avg week
require(lubridate)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:memisc':
##
## is.interval
## The following object is masked from 'package:plyr':
##
## here
## The following object is masked from 'package:base':
##
## date
by_wk_sum <- tapply(az$contb_receipt_amt, week(az$proper_date), sum)
by_wk_avg <- tapply(az$contb_receipt_amt, week(az$proper_date), mean)
plot(by_wk_avg, type = 'l')
plot(by_wk_sum)
# note the weeks are aggregated by all years
az$week <-format(az$proper_date, format = "%W")
az$month <-format(az$proper_date, format = "%m")
az$year <-format(az$proper_date, format = "%y")
by_wk <- az %>% group_by(year = as.numeric(year), week= as.numeric(week), party) %>% summarise(sum = sum(contb_receipt_amt),
avg = mean(contb_receipt_amt),
n=n())
ggplot(subset(by_wk, avg> 0 & year > 14),
aes(as.numeric(week), avg, color=party, size=n))+ geom_line()+
facet_grid(~year)
It seems that the avg amount of contributions are huge at the beginning of 2015, but when I added a 4th variable (n = number of contributions) it shows that these were a few outliers, the mass of the contributions came in mid 2016 as it lowered the average but the size (n) was bigger substantially.
library(choroplethrZip)
library(choroplethr)
## Loading required package: acs
## Loading required package: XML
##
## Attaching package: 'acs'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:base':
##
## apply
require(glue)
data(df_zip_demographics)
data("zip.regions")
zip.regions
az_demographics <- subset(zip.regions, region > 85001 & region < 86556)
# create a function that selects zipcode and region for zip_choropleth analysis
clean <- function(x, y){
region <- x
value <- y
new <- data.frame(region, value)
new <- new[!duplicated(new$region), ]
}
# group by other intersting variables.
zip_avg_sum <- az %>%
group_by(clean_zip, party) %>%
summarise(
average_contrib = mean(contb_receipt_amt),
sum_of_contribs = sum(contb_receipt_amt))
# Democratic contrib heatmap
demo_zip_avg_sum <- na.omit(zip_avg_sum[zip_avg_sum$party == 'democrat',])
demo_zip_avg_sum <- clean(demo_zip_avg_sum$clean_zip, demo_zip_avg_sum$sum_of_contribs)
demo_heat <- zip_choropleth(demo_zip_avg_sum,state_zoom = 'arizona', county_zoom = 4013 )+
ggtitle('Maricopa County Total USD Contribution (by Zipcode)', subtitle = 'Democratic Party Candidates')+
labs(color= 'USD Amount')
## Warning in super$initialize(zip.map, user.df): Your data.frame contains
## the following regions which are not mappable: 85001, 85002, 85005, 85010,
## 85011, 85060, 85061, 85063, 85064, 85066, 85067, 85068, 85069, 85070,
## 85071, 85078, 85080, 85082, 85117, 85130, 85178, 85211, 85244, 85246,
## 85252, 85261, 85267, 85269, 85271, 85274, 85275, 85277, 85280, 85285,
## 85299, 85312, 85318, 85327, 85358, 85366, 85372, 85376, 85378, 85385,
## 85502, 85532, 85547, 85628, 85636, 85652, 85702, 85717, 85721, 85728,
## 85731, 85732, 85733, 85734, 85738, 85740, 85751, 85752, 85754, 85902,
## 86002, 86302, 86304, 86312, 86339, 86340, 86341, 86402, 86405, 86412,
## 86427, 86430, 86439
## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 85343, 85337, 85309, 85322
# Republican contrib heatmap
repub_zip_avg_sum <- na.omit(zip_avg_sum[zip_avg_sum$party == 'republican',])
repub_zip_avg_sum <- clean(repub_zip_avg_sum$clean_zip, repub_zip_avg_sum$sum_of_contribs)
repub_heat <- zip_choropleth(repub_zip_avg_sum,state_zoom = 'arizona', county_zoom = 4013)+
ggtitle(' ',subtitle = 'Republican Party Candidates')
## Warning in super$initialize(zip.map, user.df): Your data.frame contains
## the following regions which are not mappable: 85001, 85002, 85005, 85010,
## 85011, 85030, 85046, 85060, 85063, 85064, 85066, 85067, 85068, 85069,
## 85070, 85071, 85076, 85080, 85082, 85117, 85127, 85130, 85178, 85191,
## 85211, 85214, 85216, 85227, 85230, 85236, 85244, 85246, 85252, 85261,
## 85267, 85269, 85271, 85274, 85275, 85277, 85280, 85285, 85287, 85299,
## 85311, 85312, 85318, 85327, 85358, 85359, 85366, 85369, 85372, 85376,
## 85378, 85380, 85385, 85502, 85532, 85547, 85548, 85628, 85636, 85652,
## 85702, 85703, 85717, 85728, 85731, 85732, 85733, 85734, 85740, 85751,
## 85752, 85754, 85902, 86002, 86302, 86304, 86312, 86339, 86340, 86341,
## 86342, 86402, 86405, 86427, 86430, 86439, 86446
## Warning: Column `region` joining character vector and factor, coercing into
## character vector
## Warning in self$bind(): The following regions were missing and are being
## set to NA: 85256, 85337, 85309, 85322, 85333
# why I can't change the color to red.
# problem: heatmap scales are not matching,
grid.arrange(demo_heat, repub_heat)
# Reasearch heatmap color and scale
# Final plot 3
require(choroplethrMaps)
## Loading required package: choroplethrMaps
df_zip_demographics <- subset(df_zip_demographics, region > 85001 & region < 86556)
ALLdemographics_df<- merge.data.frame(df_zip_demographics, zip.regions, by.x='region', by.y='region')
ALLdemographics_df
az_county<- merge.data.frame(az,ALLdemographics_df , by.x='clean_zip', by.y = 'region')
az_county
az_county_summ <- az_county %>%
group_by(county.name, county.fips.numeric, party) %>%
summarise(n= n(),
average_contrib = mean(contb_receipt_amt),
sum_of_contribs = sum(contb_receipt_amt))
# Clean avg_contribution
az_countyP1 <- with(az_county_summ, clean(county.fips.numeric, average_contrib))
az_countyP1 <- na.omit(az_countyP1)
# Plot avg_contribution by county
county_choropleth(az_countyP1, state_zoom = 'arizona')
# group by county & party
az_county_party <- merge(x = az_county, y = az[, c("party","tran_id" )], by = "tran_id", all.x=TRUE)
az_county_party_grouped <- az_county_party %>%
group_by(county.fips.numeric, party.x) %>%
summarize(average = mean(contb_receipt_amt))
# Plot demo avg_contribution by county
az_countyP1Demo <- with(az_county_party_grouped[az_county_party_grouped$party.x == 'democrat',] , clean(county.fips.numeric, average))
P1Demo <-county_choropleth(na.omit(az_countyP1Demo), state_zoom = 'arizona')+
ggtitle(label = 'Democrate Average Contribution', subtitle = 'By County')
# Plot repub avg_contribution by county
az_countyP1Repub <- with(az_county_party_grouped[az_county_party_grouped$party.x == 'republican',] , clean(county.fips.numeric, average))
P1Repub <- county_choropleth(na.omit(az_countyP1Repub), state_zoom = 'arizona')+
ggtitle(label = 'Republican Average Contribution', subtitle = 'By County')
# Arrange the party plots
grid.arrange(P1Demo, P1Repub)
Note: there is discrepancy in the color scale:
Brain storming: What can I do to improve?
What kind of graphs could I add? -Bar chart of most contributing jobs to the Donald
I am wondering what kind jobs contributed to Donald trump, my intuition says it’s mostly blue collar jobs. Let’s find out!
#-Bar chart of most contributing jobs to the donald
Trump_jobs<- az[az$cand_nm == "Trump, Donald J.",] %>%
group_by(job=contbr_occupation) %>%
summarise(avg = mean(contb_receipt_amt),
n = n())
ggplot(Trump_jobs[Trump_jobs$n >100,], aes(x=job, y=avg))+
geom_bar(stat = 'identity')
#-Bar chart of most contributing jobs to Clinton
Clinton_jobs<- az[az$cand_nm == "Clinton, Hillary Rodham",] %>%
group_by(job=contbr_occupation) %>%
summarise(avg = mean(contb_receipt_amt),
n = n())
ggplot(Clinton_jobs[Clinton_jobs$n >390 & Clinton_jobs$avg > 0 ,], aes(x=job, y=avg))+
geom_bar(stat = 'identity')
unique(az$cand_nm)
## [1] Trump, Donald J. Sanders, Bernard
## [3] Cruz, Rafael Edward 'Ted' Clinton, Hillary Rodham
## [5] Stein, Jill Carson, Benjamin S.
## [7] Paul, Rand Fiorina, Carly
## [9] Rubio, Marco Johnson, Gary
## [11] Bush, Jeb Kasich, John R.
## [13] Santorum, Richard J. McMullin, Evan
## [15] Webb, James Henry Jr. Huckabee, Mike
## [17] Walker, Scott Christie, Christopher J.
## [19] Jindal, Bobby O'Malley, Martin Joseph
## [21] Lessig, Lawrence Graham, Lindsey O.
## [23] Perry, James R. (Rick)
## 24 Levels: Bush, Jeb Carson, Benjamin S. ... Webb, James Henry Jr.
My hypothesis is false, most of trumps contributors have white collar jobs, even the higher income types such as engineers, consultants, physicians and CEOs. One weakness of this plot, it does not represent low income contributors.
let me see by number of contributions only if it helps me find out more,
ggplot(Trump_jobs[Trump_jobs$n >100 & Trump_jobs$n < 2000,], aes(x=job, y=n))+
geom_bar(stat = 'identity')
sum(Trump_jobs$n)/sum(cand_sum$n)
## [1] 0.1344482
ggplot(Trump_jobs[Trump_jobs$n >20 & Trump_jobs$n < 100,], aes(x=job, y=n))+
geom_bar(stat = 'identity')+
theme(axis.text.x=element_text(angle=90,hjust=1))
The percentage of contributions for trump of the whole data-set is 18%.
By changing some of the subset filters, still the majority were high income occupations, even though we have some blue collar jobs such as truck driver and construction, but they were the minority. My hypothesis is blue-collar workers cannot afford to contribute therefore, they are underrepresented in this data-set.
I want to see if higher income zip-codes had more contributions and I will use a scatter-plot to demonstrate.
# final plots 2
# merge demographic data with original dataset
AZ_ALLdemographics_df <- merge.data.frame(az, ALLdemographics_df, by.x = 'clean_zip', by.y = 'region')
income_cotrib <- AZ_ALLdemographics_df %>% group_by(clean_zip, income= per_capita_income)%>% summarise(n = n(), total= sum(contb_receipt_amt),
average = mean(contb_receipt_amt))
ggplot(income_cotrib[income_cotrib$n >100,], aes(income, total))+
geom_point()+
geom_smooth()+
xlab('Median Income per Zipcode')+
ylab('Total Population per Zipcoe')+
ggtitle('Relationship Between Number of Median Income and Population per Zipcode')
## `geom_smooth()` using method = 'loess'
ggplot(income_cotrib[income_cotrib$n >100,], aes(income, average))+
geom_point()+
geom_smooth()+
xlab('Median Income per Zipcode')+
ylab('Average Contibution per Zipcode')+
ggtitle('Relationship Between Average USD Amount of Contributions and Median Income', subtitle = 'Per Zipcode')
## `geom_smooth()` using method = 'loess'
There is only a strong relationship When I subsetted the data to 100 contributions at least per zip-code. Doing otherwise will skew the data and the relationship will not be apparent.
This is obvious but still I would like to see the relationship between number of contributions and population of zip-code.
#final plot 1
AZ_ALLdemographics_df
pop_contrib <- AZ_ALLdemographics_df %>%
group_by(clean_zip, total_population)%>%
summarize(n=n())
ggplot(pop_contrib, aes(n,total_population))+
geom_point()+
geom_smooth()+
xlim(0,1000)+
xlab('Number of Contributions')+
ylab('Total Population')+
ggtitle('Relationship Between Number of Contributions and Population of per Zipcode')
## `geom_smooth()` using method = 'loess'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).
There is a strong correlation at first, but then as population increases the relationship weakens.
finalp1<- ggplot(subset(na.omit(by_wk), avg> 0 & year > 14),
aes(as.numeric(week), avg, color=party, size=n))+ geom_line()+
facet_grid(~year)+
ggtitle('Average USD Contributed Along 2015/16 Weeks by Party')+
xlab('Week #')+
ylab('USD Contributed (Average)')+
labs(color= 'Party', size = 'Number of Contributions')
finalp1
finalp2 <- ggplot(pop_contrib, aes(n,total_population))+
geom_point()+
geom_smooth()+
xlim(0,1000)+
xlab('Number of Contributions')+
ylab('Total Population')+
ggtitle('Relationship Between Number of Contributions and Population of per Zipcode')
finalp2
## `geom_smooth()` using method = 'loess'
## Warning: Removed 32 rows containing non-finite values (stat_smooth).
## Warning: Removed 32 rows containing missing values (geom_point).
finalp3 <- ggplot(income_cotrib[income_cotrib$n >100,], aes(income, total))+
geom_point()+
geom_smooth()+
xlab('Median Income per Zipcode')+
ylab('Total Population per Zipcoe')+
ggtitle('Relationship Between Number of Median Income and Population per Zipcode')
finalp3
## `geom_smooth()` using method = 'loess'
finalp4 <- ggplot(income_cotrib[income_cotrib$n >100,], aes(income, average))+
geom_point()+
geom_smooth()+
xlab('Median Income per Zipcode')+
ylab('Average Contibution per Zipcode')+
ggtitle('Relationship Between Average USD Amount of Contributions and Median Income', subtitle = 'Per Zipcode')
finalp4
## `geom_smooth()` using method = 'loess'
finalp5 <- county_choropleth(az_countyP1, state_zoom = 'arizona')
finalp5
finalp6 <- grid.arrange(P1Demo, P1Repub)
finalp6
## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
finalp7 <- grid.arrange(demo_heat, repub_heat)
overall this project was a good challenge and learning experience. At first it was easy and enjoyable exploring the data, as I went deeper into the analysis it became harder to come up with relationships and conclusions about the data. I wanted my analysis to have a central theme/thesis, the fact of not drawing a certain conclusion made me feel frustrated.
I was impressed with the versatility of R, and its packages, I felt like it was more intuitive than python, maybe because I have a background with Alteryx. Although, R felt like it had less support on stackoverflow than python, but there’s support nonetheless, which aided me significantly throughout the project. I also used Datacamp for filling in the knowledge gaps and reinforcing the concepts learned in the Udacity curriculum. I have not utilized Udacity’s live help as much as the other projects, because I did not face problems with programming itself, rather than loss of ideas and direction of my analysis.
In terms of visualizations, R is fantastic for data exploration, although it is lacking the ability to export high resolution plots. I feel that Tableau is more suitable for findings/conclusive plots.